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Abstract 

The  recent  success  of  deep  learning  is  driving  a  trend 
towards  structurally  complex  computer  vision  models 
that  combine  feature  extraction  with  predictive  ele¬ 
ments  into  integrated  pipelines.  While  some  of  these 
models  have  achieved  breakthrough  results  in  applica¬ 
tions  like  object  recognition,  they  are  difficult  to  de¬ 
sign  and  tune,  impeding  progress.  We  feel  that  vi¬ 
sual  analysis  can  be  a  powerful  tool  to  aid  iterative 
development  of  deep  model  pipelines.  Building  on  fea¬ 
ture  evaluation  work  in  the  computer  vision  commu¬ 
nity,  we  introduce  ML-O-SCOPE,  an  interactive  visual¬ 
ization  system  for  exploratory  analysis  of  convolutional 
neural  networks,  a  prominent  type  of  pipelined  model. 
We  present  ML-O-SCOPe’s  time-lapse  engine  that  pro¬ 
vides  views  into  model  dynamics  during  training,  and 
evaluate  the  system  as  a  support  for  tuning  large  scale 
object-classification  pipelines. 


1  Introduction 

A  new  generation  of  pipelined  machine  learning  models 
is  achieving  significantly  higher  performance  than  older 
approaches  to  computer  vision  applications.  Thanks  to 
the  large  scale  of  online  activity  data,  and  to  novel  ways, 
like  crowd-sourcing,  of  collecting  it,  data  sets  of  un¬ 
precedented  size  and  depth  are  now  available  for  mod¬ 
eling  [5]  [16].  At  the  same  time,  hardware  advances 
and  specialized  software  implementations  that  take  ad¬ 
vantage  of  acceleration  [8]  [7]  and  distribution  [2]  have 
enabled  larger  models  to  be  trained  on  larger  data  sets. 
As  a  result,  models  are  growing  correspondingly  with 
data  sets  in  order  to  encode  the  richness  of  sample  pop¬ 
ulations  with  the  highest  possible  fidelity. 

The  growth  of  model  complexity,  however,  is  not  sim¬ 
ply  in  terms  of  sheer  size,  or  number  of  model  param¬ 
eters,  but  rather  in  terms  of  how  many  distinct  stages 
of  processing  a  model  composes.  Successful  large  scale 


models  combine  series  of  pipelined  operators  into  a  co¬ 
herent  data  flow.  Such  pipelined  models  treat  an  appli¬ 
cation  from  end-to-end,  including  raw  input  normaliza¬ 
tion,  stages  of  feature  extraction,  and  ultimately,  pre¬ 
diction. 

Artificial  neural  networks  are  a  classic  example  of  a 
pipeline,  with  each  layer  performing  a  function  and 
the  back-propagation  algorithm  providing  a  unified  ap¬ 
proach  to  training  [13].  The  momentum  behind  deep 
learning,  or  the  application  of  many-layered  convolu¬ 
tional  neural  networks  to  large  scale  learning  problems, 
has  proved  a  major  driver  of  pipelined  model  complex¬ 
ity.  The  recent  success  of  deep  learning  underscores  the 
importance  of  large,  composite  models  and  the  need  for 
tools  to  manage  their  complexity  [10]  [19]. 

New  tools  are  necessary,  because  large  models  present 
new  challenges  to  designers.  The  fundamental  challenge 
of  working  with  pipelined  models  is  to  decide  what  op¬ 
erators  to  include,  and  in  what  order,  to  maximize  pre¬ 
dictive  accuracy  and  avoid  over-fit.  Although  individual 
operators  have  well  defined  functions,  their  combined  ef¬ 
fect  can  be  difficult  to  predict  and  optimize.  Moreover, 
each  operator  often  has  associated  hyper-parameters 
that  must  be  tuned  for  peak  performance.  Because  up¬ 
stream  operators  affect  the  input  to  downstream  ones, 
decisions  about  components  and  their  parameters  can¬ 
not  be  made  in  isolation  from  one  another:  the  opti¬ 
mization  space  is  very  large.  In  the  case  of  large  con¬ 
volutional  neural  networks,  these  difficulties  have  made 
pipeline  design  impossible  for  all  but  a  small  number 
of  expert  practitioners  with  extensive  experience  in  the 
field. 

We  propose  visualization  as  a  means  to  address  the 
challenges  of  scale  and  complexity,  and  to  make  high- 
end  machine  learning  pipelines  approachable.  Visual¬ 
izations  of  different  pipeline  states  can  illustrate  why 
particular  configurations  succeed  or  fail,  and  at  what 
points  particular  designs  break  down.  These  visuals 
can  allow  non-experts  to  better  explore  and  understand 
the  internal  dynamics  of  pipelined  models,  and  gain  in¬ 
sights  into  what  works  from  just  a  few  model  instances. 
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In  this  way,  the  optimization  space  can  be  navigated 
quickly. 

To  demonstrate  the  utility  of  visualization  applied  to 
pipeline  tuning,  we  have  developed  ML-O-SCOPE,  an  in¬ 
teractive  visualization  system  for  analysis  of  deep  model 
pipelines.  Given  a  set  of  model  snapshots,  saved  during 
training,  an  input  corpus,  and  adaptor  code  to  query 
them,  ML-O-SCOPE  offers  navigation,  displays,  and  in¬ 
teractions  that  allow  users  to  explore  their  model  and 
its  relation  to  the  data. 

While  visualization  may  not  suit  all  types  of  input  data, 
we  focus  on  computer  vision  applications  where  data  is 
visual  by  nature.  The  modular  composition  of  pipelines 
facilitates  inspection  of  intermediate  data,  i.e. ,  transfor¬ 
mations  of  images  as  they  pass  through  the  model.  In 
certain  cases,  operators  themselves  are  visualizable.  To 
take  convolutional  operators  as  an  example,  we  can  vi¬ 
sualize  their  parameters  as  image  filters.  Recent  work 
has  shown  that  visualization  of  these  reconstructed  in¬ 
termediate  states  can  be  an  aid  to  model  tuning  [19], 

[17]- 

Automation  has  been  proposed  as  an  alternative  ap¬ 
proach  to  pipeline  tuning  [18],  but  only  for  modestly 
sized  models.  An  automated  tuning  algorithm  will  se¬ 
lect  a  series  of  parameter  settings  and  train  and  evaluate 
a  model  for  each  setting,  eventually  returning  the  high¬ 
est  performing  one.  But  with  models’  size  complexity 
growing  into  the  tens  of  millions  or  even  the  billions 
of  parameters  [2],  it  can  take  days  or  even  weeks  to 
train  a  single  instance  of  a  top-performing  model.  At 
this  scale,  training  time  is  too  great,  and  the  search 
space  too  large,  for  automated  tuning  to  have  much  im¬ 
pact  without  expert  guidance  in  the  context  of  a  defined 
workflow. 

ML-O-SCOPE  targets  an  iterative  workflow  for  develop¬ 
ment  and  refinement  of  pipelined  models.  At  a  given 
stage  in  the  design  process,  a  user  trains  a  pipelined 
architecture  and  saves  regular  checkpoints  of  its  state 
during  the  training  process.  The  user  then  uses  ML- 
O-SCOPE  to  inspect  individual  pipeline  stages;  ana¬ 
lyze  properties  of  the  training  process,  like  convergence 
rates;  and  diagnose  weaknesses.  These  observations 
lead  to  revisions  to  the  model  architecture  and  further 
rounds  of  training,  visualization,  and  assessment. 

To  evaluate  its  usefulness,  we  apply  ML-O-SCOPE  to 
several  pipelines  for  visual  object  classification,  trained 
on  the  CIFAR-10  and  ILSVRC  2012  (ImageNet)  data 
sets.  We  find  that  the  system  is  a  powerful  tool  for  ex¬ 
ploratory  analysis  of  the  tested  models,  and  in  practice, 
allows  users  to  find,  diagnose,  and  act  on  interesting 
properties  of  these  complex  models. 


2  Related  Work 

Recent  work  by  Zeiler  and  Fergus  [19]  demonstrates  the 
use  of  visualization  to  analyze  deep  convolutional  neu¬ 
ral  networks.  They  apply  a  technique  called  deconvo¬ 
lution  [20]  to  construct  illuminating  visual  representa¬ 
tions  of  individual  points  in  the  model.  In  short,  they 
highlight  regions  of  a  sample  image — from  simple  pat¬ 
terns  to  complex  objects  like  faces — that  maximize  the 
output  of  part  of  the  model.  While  these  visualiza¬ 
tions  are  compelling,  the  authors  find  direct  views  of 
parameters  better  suited  to  model  improvement.  They 
use  visualizations  of  filters  from  the  model  design  of 
Krizhevsky,  Sutskever,  and  Hinton  [10]  to  adjust  two 
hyper-parameters  and  consequently  boost  performance 
significantly.  With  ML-O-SCOPE,  we  aim  to  extend 
their  work  on  visual  parameter  tuning  with  an  interac¬ 
tive  system  and  by  adding  the  time  dimension  to  anal¬ 
ysis. 

Visualization  has  been  used  to  explicate  convolutional 
neural  networks  since  some  of  the  earliest  implementa¬ 
tions.  LeNet  [12],  a  system  for  handwritten  digit  recog¬ 
nition,  and  one  of  the  the  first  to  achieve  near-human 
accuracy,  notably  includes  an  interactive  visualization 
system  to  display  predictions  and  features  extracted 
from  input  images.  The  visualizations  allow  direct  and 
compelling  demonstration  of  important  properties  of  the 
system  like  invariance  to  translations  and  deformations 
of  the  input.  While  LeNet’s  visualizations  provide  evi¬ 
dence  for  the  system’s  merits,  they  do  not  serve  as  de¬ 
sign  aids  to  practitioners. 

Beyond  neural  networks,  visualization  has  been  used  in 
computer  vision  more  generally  as  a  tool  to  aid  in  fea¬ 
ture  evaluation.  In  [17],  Vondrick,  Khosla,  and  Mal- 
isiewicz  argue  for  the  necessity  of  visual  inspection  of 
image  features  to  understand  models’  failures.  They 
use  feature  inversion  algorithms,  whereby  image  fea¬ 
tures  are  transformed  back  into  the  original,  human- 
comprehensible  image-space,  thereby  giving  intuitive 
access  to  abstract  features.  Le,  et  al.  [11]  perform  in¬ 
verse  optimization  on  a  model  trained  by  unsupervised 
learning  to  construct  the  optimal  inputs  for  single  pa¬ 
rameters.  In  particular,  they  find  single  deep  neurons 
trained  to  respond  to  faces  (both  human  and  feline) 
and  bodies.  ML-O-SCOPE  incorporates  similar  feature 
visualizations  together  with  an  interactive  interface  to 
enable  exploration  and  analysis  of  individual  vision  op¬ 
erators  in  the  context  of  a  complete  pipeline. 

Much  recent  work  explores  the  growing  design  space 
of  machine  learning  pipelines,  and  convolutional  neu¬ 
ral  networks  in  particular.  Jarrett,  et  al.  [6]  evaluate 
architectural  variations  of  different  hand-designed  net¬ 
works  on  several  data  sets.  Others,  like  Yamins,  Tax, 
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Figure  1.  The  ML-O-SCOPE  user  interface. 


and  Bergstra  [18],  use  Bayesian  methods  to  automati¬ 
cally  search  the  parameter  space  of  convolutional  net¬ 
works.  ML-O-SCOPE  intends  to  supplement  such  efforts 
by  helping  users  develop  heuristics  to  guide  search  and 
evaluation  in  this  increasingly  complex  space. 


3  Background 

We  begin  by  defining  machine  learning  pipelines  and  ex¬ 
ploring  the  design  space  of  one  particularly  important 
example:  deep  convolutional  neural  networks.  Briefly, 
we  will  highlight  how  major  components  and  opera¬ 
tors  in  these  systems  work,  and  how  a  whole  pipeline 
is  trained.  While  these  networks  can  be  trained  for  un¬ 
supervised  learning  tasks,  we  focus  on  the  supervised 
case. 

3.1  Pipelined  Models 

To  apply  machine  learning  to  a  problem  usually  requires 
two  steps.  The  first  is  to  identify  and  possibly  engineer 
a  good  set  of  predictive  features  from  raw  data  and  the 
second  is  to  train  a  model  on  these  features.  Formal 


machine  learning  focuses  mainly  on  the  second  problem, 
and  provides  a  variety  of  techniques  and  guidance  to 
solve  it.  Feature  extraction,  however,  remains  an  ad  hoc 
process  that  depends  greatly  on  the  problem  domain. 
For  applications  like  computer  vision  where  the  best 
inputs  to  a  classifier  are  not  at  all  obvious—  how  does 
one  get  from  a  bitmap  of  pixels  to  a  catalog  of  objects? — 
improving  model  performance  consists  almost  entirely 
in  improving  the  features  fed  in. 

Pipelined  models  seek  to  couple  feature  extraction  with 
prediction  components  so  that  they  can  be  co-designed 
and  optimized.  A  pipelined  model  is  a  series  of  op¬ 
erators  that  first  preprocess  raw  data,  then  extract  fea¬ 
tures,  and  finally  use  those  features  to  make  predictions. 
Because  operators  are  modular  and  have  uniform  data 
flow  interfaces,  a  pipeline  framework  allows  easy  exper¬ 
imentation  with  overall  architecture.  For  example,  one 
operator  can  be  directly  substituted  for  another,  or  a 
series  of  operators  could  be  rearranged. 

In  designing  ML-O-SCOPE,  we  focus  on  deep  convolu¬ 
tional  neural  networks.  As  pipelines,  this  class  of  model 
architecture  uses  different  compositions  of  convolutional 
and  other  image  processing  operators  for  feature  extrac¬ 
tion,  followed  by  typical  neural  network  classification 
structures.  Such  pipelines  add  the  ability  to  train  the 
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feature  extraction  components  together  with  the  pre¬ 
dictive  components,  via  gradient  back-propagation.  In 
certain  respects,  this  represents  a  way  to  automate, 
through  learning,  the  difficult  task  of  feature  engineer¬ 
ing. 


3.2  Convolutional  Neural 
Networks 

In  general,  an  artificial  neural  network  is  a  pipeline 
where  operators  are  described  as  layers  of  so-called  neu¬ 
rons.  A  neuron  computes  a  function  on  inputs  from 
the  preceding  layer  and  passes  the  result,  sometimes 
called  the  neuron’s  activation ,  to  outputs  in  the  suc¬ 
ceeding  layer.  Within  each  layer,  all  neurons  compute 
the  same  function,  but  individual  neurons  may  have  dis¬ 
tinct  sets  of  inputs  and  outputs  and  may  assign  different 
weights  to  their  inputs.  Different  types  of  layers  are  de¬ 
fined  by  the  number  and  pattern  of  connections  between 
neurons,  and  the  functions  they  compute.  Successions 
of  fully  connected  layers,  where  neurons  receive  input 
from  every  output  in  the  preceding  layer,  function  as 
predictive  units  [15].  Convolutional  layers'  neurons  are 
connected  only  to  a  local  neighbors  of  outputs  from  a 
preceding  layer  in  such  a  way  that  they  compute  the 
convolution  of  an  input  ’’image”  with  a  filter.  We  de¬ 
scribe  convolution  in  greater  detail  below.  Other  types 
of  layers  may  perform  other  types  of  data  and  image 
processing  including  contrast  normalization  and  sam¬ 
pling. 

As  described  above,  a  complete  network  architecture  is 
a  pipelined  series  of  feature  extraction  layers,  like  convo¬ 
lutions  and  down-sampling,  followed  by  predictive  lay¬ 
ers.  When  applied  to  object  classification,  the  output 
of  a  pipeline  will  be  a  vector  of  probabilities  predict¬ 
ing  to  which  class  an  input  image  belongs.  This  output 
can  be  used  by  an  optimization  algorithm,  like  gradient 
descent,  to  update  the  pipeline  and  reduce  error.  Back- 
propagation  is  an  algorithm  that  allows  this  optimiza¬ 
tion  process  to  be  applied  to  all  the  layer  in  the  network, 
including  those  involved  in  feature  extraction. 


3.2.1  Convolution 

Since  many  of  the  visualizations  implemented  in  ML-O- 
SCOPE  relate  to  convolutional  operators,  we  give  a  brief 
review  of  convolution.  Convolution  applies  a  filter  to  an 
image  to  produce  a  new  image.  A  filter  is  a  k  x  k  weight- 
matrix  where  k  is  an  odd  number  (so  that  the  matrix 
has  a  center  pixel).  Pixels  in  the  output  image  are  pro¬ 
duced  by  placing  the  filter  on  top  of  the  input  image, 
with  its  center  aligned  at  the  corresponding  pixel,  and 


computing  the  dot  product  of  the  filter  with  the  pixels 
below  it. 


(Michael  Plotke  /  CC-BY-SA-3.0) 

Figure  2.  Image  convolution. 

In  effect,  the  convolution  moves  the  filter  across  the  im¬ 
age  and  replaces  each  pixel  with  some  filtered  combi¬ 
nation  of  its  neighbors.  In  fact,  convolutional  trans¬ 
formations  can  perform  various  useful  image  processing 
functions,  like  emphasizing  edges  and  computing  gra¬ 
dients  of  hue  and  value.  Moreover,  deep  successions 
of  convolutions  have  been  shown  to  produce  image  en¬ 
codings  that  are  favorable  for  classification,  owing  to 
emergent  invariance  to  translation  and  deformation  [1], 
But  exactly  what  is  computed — and  its  usefulness  for 
classification — depends  on  the  filters  used,  and  therefore 
success  of  a  convolutional  network  depends  crucially  on 
choosing  good  filters. 

3.3  Pipeline  Design  Space 

Recent  success  in  image  classification  has  come  from 
going  deeper:  composing  pipelines  with  more  con¬ 
volutional  layers  and  more  filters  per  layer.  By 
learning  features  instead  of  engineering  them  directly, 
back-propagation  has  given  well-designed,  well-tuned 
pipelines  a  major  advantage  in  complex  domains  like 
vision. 

But  in  a  certain  respect,  the  promise  of  automatically 
learned  features  is  undercut  by  the  imposition  of  a  new 
challenge:  pipelines  are  complicated  entities  that  are 
difficult  to  design.  The  problem  shifts  from  engineer¬ 
ing  good  features  to  engineering  a  pipeline  capable  of 
learning  good  features. 

The  case  of  convolutional  neural  networks  is  illustrative 
of  the  difficulty  of  optimization.  Although  at  a  high 
level  the  design  is  straightforward — a  sequence  of  convo¬ 
lutional  operators  followed  by  a  classifier-  -many  details 
need  to  be  tuned.  At  the  architectural  level,  the  number 
of  convolutional  layers  must  be  determined.  Additional 
convolutions  tend  to  improve  model  performance,  but 
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at  some  point  the  marginal  return  of  another  layer  is 
outweighed  by  its  added  complexity.  The  number  and 
position  of  non-convolutional  operators — both  for  fea¬ 
ture  extraction,  e.g.,  sampling  and  normalization,  and 
for  prediction-  -must  also  be  decided. 

More  decisions  are  attendant  on  the  level  of  individual 
operators.  Convolutional  operators  have  no  shortage 
of  hyper-parameters,  including,  the  number  of  filters  in 
them,  the  size  of  those  filters,  how  those  filters  con¬ 
nect  to  filters  in  the  layers  before  and  after,  and  so 
on.  Hyper-parameters  of  non-convolutional  layers  in¬ 
clude  sampling  ratios,  and  fully-connected  layer  sizes. 
Learning  parameters  like  gradient  descent  step  size,  reg¬ 
ularization  coefficients,  and  initial  model  weight  distri¬ 
butions  add  yet  more  dimensions  to  the  design  space 
that  must  be  tuned. 

Finally,  these  various  design  decisions  cannot  in  general 
be  made  in  isolation  from  one  another.  Properties  of  one 
operator  will  affect  the  behavior  of  other  downstream 
from  it.  Moreover,  large-scale  pipelined  models  are  run 
in  resource  constrained  settings.  For  example,  the  deep 
convolutional  architecture  of  Krizhevsky  et  al.  [10]  is  de¬ 
signed  to  saturate  a  specific  model  GPU.  In  this  regime, 
decisions  to  allocate  more  resources  to  one  operator, 
e.g.,  more  filters  in  a  convolutional  layer,  must  trade 
decreased  performance  elsewhere  in  the  model. 

All  of  these  factors  can  have  a  dramatic  impact  on  model 
performance  and  complexity.  By  offering  visual  tools 
to  analyze  the  effects  of  design  decisions,  ML-O-SCOPE 
enables  users  to  explore  the  design  space  without  blindly 
trying  all  possible  permutations. 


4  The  ML-o-scope  System 

We  implemented  the  ML-O-SCOPE  system  to  investi¬ 
gate  the  usefulness  of  visual  exploratory  analysis  ap¬ 
plied  to  convolutional  neural  network  pipeline  optimiza¬ 
tion.  ML-O-SCOPE  is  a  light-weight  web  application 
that  allows  users  to  visually  examine  saved  snapshots 
of  a  trained  model.  This  section  gives  an  overview  of 
ML-O-SCOPE’s  system  architecture  and  the  visual  and 
navigational  features  that  aid  model  exploration  and  di¬ 
agnosis. 

4.1  Representing  Models 

Most  features  of  ML-O-SCOPE  are  built  upon  a  core  ab¬ 
stract  data  model  of  convolutional  neural  networks.  The 
back-end  supports  three  classes  of  visualization:  views 
of  model  parameters,  views  of  features  (data  trans¬ 
formed  by  the  model),  and  summary  views.  Parame¬ 


ter  and  feature  views,  as  well  as  navigational  features, 
access  data  and  meta-data  from  saved  model  instances 
through  this  core  abstraction.  Summary  views  are  sup¬ 
ported  by  a  separate  pre-computed  statistics  database 
described  below. 

The  data  model  for  queries  includes  model  checkpoints, 
layers,  and  model  parameters.  A  model  checkpoint  is  a 
complete  instance  of  the  model  at  some  point  during  the 
training  process,  typically  measured  in  epochs  or  itera¬ 
tions.  Model  instances  contain  a  set  of  layers,  defined  by 
their  architecture,  and  each  layer  contains  some  num¬ 
ber  of  parameters.  Often,  parameters  are  grouped  in  a 
natural  way,  as  is  the  case  with  filters  in  convolutional 
layers.  ML-O-SCOPE  stores  meta-data  about  connected 
models  that  describe  overall  pipeline  architecture  and 
details  about  each  layer. 

Different  implementations  of  convolutional  pipelines 
store  models  in  distinctive  formats  that  may  not  align 
with  ML-O-SCOPE’s  own  representation.  To  handle 
this  heterogeneity,  ML-O-SCOPE  provides  an  interface 
to  register  adaptor  code.  The  adaptor  abstraction 
consists  of  a  core  set  of  query  primitives  for  access¬ 
ing  checkpoints,  layers,  and  parameters,  as  well  as 
meta-data  about  pipeline  architecture.  With  adap¬ 
tors,  all  connected  model  instances  can  be  accessed 
through  the  same  abstract  data  layer.  We  have  imple¬ 
mented  adaptors  for  models  trained  by  decaf,  caffe, 
and  cuda-convnet,  each  of  fewer  than  100  lines  of 
python. 

4.2  Supporting  Views  and 
Navigation 

Visualizations  of  model  parameters  can  be  built  directly 
from  the  results  of  model  queries.  Visualizations  of  in¬ 
termediate  feature  data,  on  the  other  hand,  require  the 
model  to  be  evaluated  on  some  input  data.  Our  model 
adaptor  interface  allows  us  to  import  a  model  check¬ 
point  into  decaf  [3],  a  python  native  implementation 
of  convolutional  neural  networks,  and  evaluate  it  on  de¬ 
mand. 

Views  of  feature  data  further  depend  on  access  to  a  col¬ 
lection  of  image  data  from  which  to  draw  examples.  Like 
model  instances,  data  sets  can  be  connected  to  ML-O- 
SCOPE  via  an  adaptor  interface.  Data  adaptors  imple¬ 
ment  access  to  individual  images  in  the  data  set,  and  can 
optionally  provide  meta-data  about  each  image.  Basic 
access  allows  users  to  find  random  images  to  test  against 
the  model.  File  names,  keywords,  and  class  label  meta¬ 
data  allows  ML-o-scope  to  give  users  a  simple  faceted 
search  interface  to  the  data  set. 

Large  pipelined  models  can  contain  an  overwhelming 
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number  of  parameters,  too  many  to  visually  inspect 
and  analyze  together  at  once.  ML-O-SCOPE  uses  model 
meta-data  to  provide  users  with  a  navigational  interface 
that  allows  reasonably  sized  chunks  of  the  model  to  se¬ 
lected  and  viewed.  Users  can  select  which  layer  to  look 
at,  and  can  further  select  a  subset  of  parameters  (or  fil¬ 
ters,  in  the  convolutional  case)  in  that  layer.  In  addition 
to  these  layer-  and  parameter-axes,  ML-O-SCOPE  pro¬ 
vides  navigation  along  the  time-axis.  By  keeping  track 
of  model  snapshots,  the  system  can  build  parameter  and 
feature  views  based  on  any  checkpointed  moment  during 
training. 


4.3  Statistics  Engine 

Several  of  our  visualizations  require  summary  statis¬ 
tics  that  are  be  computed  over  a  complete  data  set. 
These  summaries  include  prediction  performance  mea¬ 
sures,  e.g.,  counts  for  building  confusion  matrices,  and 
analysis  of  output  probability  vectors  including  cluster¬ 
ing  and  indexing.  Since  generating  model  output  over  a 
full  data  set  can  be  time  consuming,  we  provide  an  en¬ 
gine  to  compute  statistics  in  batches  and  save  them  to 
a  database.  In  the  same  way  that  views  of  feature  data 
are  built  internally  with  decaf,  the  statistics  engine  can 
import  a  model  instance  through  the  standard  interface 
and  run  it  on  a  connected  data  set.  This  is  sufficient  for 
a  small  data  set  like  CIFAR-10,  but  at  ImageNet  scale 
it  often  makes  more  sense  to  run  data  through  each 
model  with  its  native  platform  (e.g.,  the  GPU  acceler¬ 
ated  caffe  system).  In  this  case,  the  statistics  engine 
can  take  raw  output  of  predictions  and  class  probabil¬ 
ities  directly.  At  run  time,  the  statistics  database  is 
queried  via  the  same  web  service  that  powers  our  filter 
visualizations. 

Currently,  the  statistics  engine  calculates  the  following 
statistics  over  the  corpus  at  each  time  step:  the  set  of 
class  probabilities  by  image;  counts  of  model  confusion, 
i.e.,  how  often  images  from  each  class  were  predicted 
to  belong  to  each  other  class;  an  index  of  images  by 
their  actual  and  predicted  classes;  and  a  set  of  clusters 
and  the  k-nearest  neighbors  of  those  clusters  that  are 
calculated  from  the  class  posterior  probability  vectors 
output  by  the  model.  Each  of  these  statistics  is  used  to 
drive  one  of  the  views  described  below. 


4.4  Web  Application 

We  provide  access  to  the  model  query  interface,  visual¬ 
ization  generation,  and  statistics  engine  via  a  RESTful 
interface  backed  by  a  flask  (python)  application.  The 
client  uses  web  requests  to  this  interface  to  query  for 


model  state,  feature  data,  meta-data,  and  summary  in¬ 
formation.  Responses  are  returned  as  either  bitmap  im¬ 
age  data  (PNG),  vector  graphics  (SVG),  or  JSON.  The 
client  is  responsible  for  issuing  requests  and  handling  re¬ 
sponses  in  order  to  display  views  and  enable  interactive 
exploration.  Since  exploratory  analysis  often  involves 
issuing  many  related  queries,  the  server  makes  heavy 
use  of  caching  to  reduce  latency  when  the  same  object 
is  requested  multiple  times. 

ML-O-SCOPE  allows  all  parameter  and  feature-space 
views  to  be  animated,  so  that  users  can  watch  how  they 
evolve  over  the  course  of  training.  To  avoid  flickering  in 
the  animation,  which  could  contribute  to  change  blind¬ 
ness  and  a  [4]  diminished  experience,  the  front-end  uses 
several  optimizations  to  maintain  a  responsiveness.  Im¬ 
ages  are  pre-fetched  by  the  browser  and  positioned  off¬ 
screen  until  they  have  loaded  completely.  With  this 
approach,  frame  updates  are  seamless  and  don’t  require 
a  round-trip  to  the  web  service. 

5  Visual  Analysis 

We  present  the  views  supported  by  ML-O-SCOPE  to  help 
model  builders  understand  their  convolutional  neural 
networks.  The  main  display  lets  users  interactively  ex¬ 
plore  different  components  of  a  network  and  view  its  in¬ 
ternal  structure  directly  and  via  features  extracted  from 
sample  data.  Additional  summary  views  are  available  in 
subsidiary  displays.  These  provide  supporting  informa¬ 
tion  to  help  the  user  assess  hypotheses  about  the  cause 
of  certain  types  of  errors  and  understand  the  interaction 
between  classes.  All  these  visualizations  help  users  to 
understand  how  the  model  changes  over  the  course  of 
training.  That  is,  they  provide  a  mode  of  exploration 
and  comparison  across  time  steps.  We  feel  that  this  is  an 
important  and  differentiating  characteristic  of  our  work 
that  may  enable  new  insights  into  the  model  training 
process. 

5.1  Main  View:  Filters  and 
Features 

Our  primary  display  is  a  time-lapse  view  of  model  devel¬ 
opment  for  a  particular  network  layer.  Visualizations  of 
both  the  layer’s  constituent  parameters,  and  of  the  fea¬ 
tures  produced  by  the  layer,  given  an  input  image,  are 
available  in  this  display  (see  Figure  1).  At  the  bottom  of 
the  window  are  timeline  controls  to  support  checkpoint 
selection  and  animation. 

All  of  our  views  update  in  response  to  the  current  time¬ 
line  value.  Animated  views  allow  users  to  see  how  the 
model  evolves  over  the  course  of  training  and  to  observe 
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Figure  3.  ML-O-SCOPE  primary  display.  (1)  Filter  details;  (2)  image  selector;  (3)  network  overview  and 
navigation;  (4)  filter  visualization;  (5)  visualization  selector;  (6)  selection  helper;  (7)  animation  progress  slider. 


how  structure  emerges.  To  take  one  example,  the  filters 
shown  at  right  in  Figure  1,  from  the  first  convolutional 
layer  of  a  model,  progress  into  Gabor  filters  [14],  a  well 
studied  type  of  convolutional  feature  extractor.  They 
can  be  interpreted  as  deformation-invariant  edge  detec¬ 
tors  [1],  an  effect  that  we  can  see  in  the  visualized  fea¬ 
ture  data  in  the  figure,  at  center.  It  is  important  to  note 
that  the  model  converges  on  these  filters  automatically 
as  part  of  the  training  process. 

Additional  elements  of  the  main  interface  are  high¬ 
lighted  in  Figure  3.  The  top  of  the  page  displays  an  in¬ 
teractive  graph  representation  of  the  model’s  pipelined 
architecture.  Users  can  interact  with  the  graph  to  navi¬ 
gate  the  network  and  display  meta-data.  Details  about 
the  currently  selected  layer  are  provided  in  the  upper- 
left  corner  of  the  display. 

A  search  interface  for  the  image  training  corpus  allows 
users  to  find  and  select  images  to  pass  to  the  model  and 
view.  Selected  images  are  displayed  in  the  feature  space 
of  the  currently  selected  layer  of  the  network,  so  users 
can  visualize  the  output  of  each  operator.  The  sidebar 
displays  a  histogram  of  the  model’s  predicted  classes  for 
the  selected  image. 


5.2  Summary  Views 

5.2.1  Confusion  Matrix 


Filters  Confusion  Matrix  Clustered  Images  Direct  Compare 
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Figure  4.  Confusion  matrix  display. 

The  confusion  matrix  view,  shown  in  Figure  4,  helps 
users  to  diagnose  “hot-spots”  of  misclassified  images  in 
their  model.  The  matrix’s  rows  correspond  to  true  im¬ 
age  classes  and  its  columns  correspond  to  the  model’s 
predicted  classes.  Each  cell  displays  the  number  of  im- 
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ages  from  one  class  predicted  to  be  in  another  (or,  on 
the  diagonal,  the  same)  class,  for  example,  the  number 
of  dog  images  predicted  to  contain  cats.  Shading  is  used 
to  emphasize  cells  with  large  counts  so  users  can  quickly 
perceive  troublesome  classes.  A  perfect  classifier  would 
produce  a  diagonal  confusion  matrix  with  zeros  every¬ 
where  but  on  the  main  diagonal,  so  off-diagonal  shading 
represents  problems. 

When  the  user  mouses  over  an  individual  cell,  the  cell 
expands  to  show  a  sample  of  images  that  fall  into  it.  If 
the  misclassified  images  share  common  visual  structure, 
the  user  may  choose  to  give  special  treatment  to  this 
structure  in  a  future  version  of  their  model.  For  exam¬ 
ple,  if  dark  pictures  tend  to  be  misclassified,  the  user 
might  choose  to  normalize  input  images  before  feeding 
them  into  the  network. 

Like  the  main  filter  display,  the  confusion  matrix  view 
is  linked  to  the  timeline  slider  to  show  how  the  model 
evolves  over  time. 


5.3  Clustered  Images 

To  further  aid  in  the  diagnosis  of  classification  errors, 
the  clustered  images  view  displays  a  set  of  sample  im¬ 
ages  clustered  by  their  similarity  in  the  raw  pipeline  out¬ 
put,  normally  a  vector  of  predicted  class  probabilities. 
We  cluster  using  K-Means  with  a  Euclidean  distance 
metric.  For  each  cluster,  we  display  the  closest  images 
to  the  cluster  center.  If  a  user  wants  to  understand  the 
possible  causes  of  a  set  of  misclassified  images,  they  can 
inspect  these  clusters  for  anomalies  like,  say,  a  group  of 
images  of  far-away  airplanes  that  look  like  birds.  The 
user  may  then  adjust  the  parameters  of  their  model  to 
better  handle  this  case,  for  example,  by  increasing  the 
resolution  of  filters  at  an  early  layer. 

Again,  the  time  slider  appears  in  this  view  to  enable  the 
user  to  see  how  these  clusters  evolve  as  the  output  of  the 
fully  connected  layer  changes  at  each  model  checkpoint. 
To  our  knowledge,  this  is  a  novel  approach  for  diagnos¬ 
ing  classification  issues  in  the  context  of  convolutional 
neural  networks. 


or  via  extracted  feature  data.  While  the  time-step  ani¬ 
mations  of  the  main  display  allow  a  user  to  explore  the 
incremental  evolution  of  the  model,  this  view  highlights 
major  cumulative  changes  across  distant  steps. 

We  can  already  see  some  utility  and  insights  with  this 
view.  For  example,  filters  that  are  initialized  with  high- 
variance  weights  tend  to  retain  high  variance  weights 
in  the  final  model.  The  filters  in  the  fifth  row  and  last 
row  of  the  first  convolutional  layer  all  start  with  high 
variance  and  remain  high  variance  at  the  end.  This 
information  can  be  used  to  inform  approaches  to  model 
initialization  and  regularization. 
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Figure  5.  Direct  comparison  display. 


6  Evaluation 


5.4  Direct  Comparison 

In  the  direct  comparison  view,  shown  in  Figure  5,  ML- 
O-SCOPE  provides  one  more  way  to  analyze  changes  in 
the  model  made  over  the  course  of  training.  Users  select 
two  points  in  time  during  training  and  then  can  display 
visualizations  of  those  two  snapshots  side-by-side.  As  in 
the  main  display,  users  select  which  parts  of  the  model 
to  view,  and  whether  to  view  model  parameters  directly 


To  measure  the  usefulness  of  ML-O-SCOPE,  we  have  in¬ 
strumented  it  to  connect  to  models  trained  by  any  of 
several  systems,  including  Krizhevsky’s  cudaconvnet 
[8],  and  Jia’s  caffe  [7]  and  decaf  [3].  Both 
cudaconvnet  and  caffe  use  hardware  acceleration  to 
allow  the  training  of  an  ImageNet  scale  model  in  a  few 
days  on  a  single  machine  with  a  recent  generation  GPU. 
cudaconvnet  required  slight  modification  to  save  inter¬ 
mediate  model  snapshots  during  training. 


We  used  these  systems  to  train  models  on  two  data  sets. 
CIFAR-10  [9]  is  a  modestly  sized  collection  of  images  de¬ 
picting  ten  classes  of  objects.  It  includes  60,000  images, 
each  32  by  32  pixels,  drawn  from  the  80  Million  Tiny  Im¬ 
ages  data  set  [16]  which  consists  of  ”in  the  wild”  images 
scraped  from  the  web.  Despite  its  small  size,  CIFAR- 
10’s  origins  make  it  a  rich  and  challenging  data  set  for 
object  classification. 

ML-O-SCOPE  has  also  been  instrumented  for  ImageNet 
2012  [5]  data,  and  models  trained  on  it.  This  data,  from 
the  ILSVRC  2012  challenge,  consists  of  over  one  million 
full  size  images  from  web  sources  like  Flickr. 

6.1  Exploratory  Analysis 

ML-O-SCOPE  has  proved  useful  for  understanding  the 
performance  of  CIFAR-10  and  ImageNet  models.  The 
following  use  case  illustrates  the  power  of  interactive 
exploratory  analysis  applied  to  model  pipelines. 

With  cudaconvnet,  we  trained  on  CIFAR-10  a  convo¬ 
lutional  neural  network  architecture  reported  to  achieve 
good  performance  with  relatively  little  training  [8] .  The 
architecture  consists  of  three  stages  of  convolution  and 
down-sampling,  followed  by  a  fully-connected  network 
layer.  Here  the  convolution  and  down-sampling  opera¬ 
tors  represent  the  feature  extraction  component  of  the 
pipeline,  and  the  fully-connected  layer  acts  as  a  ’’uni¬ 
versal  classifier”  [15].  This  pipeline  takes  about  ten 
epochs,  or  passes  over  the  data,  to  train  to  convergence. 
We  took  checkpoints  of  the  model  before  and  after  each 
epoch,  and  loaded  these  checkpoints  into  ML-O-SCOPE 
for  analysis. 


Figure  6.  Weights  in  the  penultimate  fully  con¬ 
nected  layer  of  a  CIFAR-10  model ,  as  initialized 
(left)  and  after  8  epochs  of  training.  Lighter  pix¬ 
els  correspond  to  higher  weights. 

In  exploring  the  development  of  model  parameters  in 
each  stage  of  the  pipeline,  we  observed  that  the  visual¬ 
ization  of  the  fully-connected  operator  remained  static 
as  training  progressed.  Since  we  expect  learning  to 
change  the  values  of  model  parameters-  which  had  been 
initialized  randomly — we  also  expected  to  see  the  visu¬ 


alizations  change.  Suspecting  a  bug  in  our  implemen¬ 
tation,  we  queried  the  model  checkpoint  files  directly 
and  found  that,  indeed,  the  parameters  of  the  fully  con¬ 
nected  layer  remained  essentially  static  during  train¬ 
ing. 

This  observation  inspired  follow-up  experiments.  We 
trained  a  slight  variation  of  the  original  architecture 
where  the  fully  connected  component  was  replaced  with 
multiplication  by  a  random  matrix.  This  is  equivalent 
to  the  original  architecture  with  learning  disabled  in 
that  layer.  The  modified  architecture  had  no  loss  of 
predictive  accuracy  on  the  CIFAR-10  test  set  (both  ar¬ 
chitectures  achieve  about  25-26%  error  after  10  epochs), 
despite  having  fewer  than  6%  as  many  learned  model  pa¬ 
rameters  as  the  original.  In  principle,  identifying  non- 
learned  components  of  a  pipelined  architecture  like  this 
could  be  exploited  by  software  implementations  to  re¬ 
duce  training  time  per  iteration. 

It  should  be  noted  that  this  observation  was  specific 
to  the  model  design  and  data  set.  For  example,  the 
corresponding  fully  connected  layers  in  the  popular 
Krizhevsky,  et.  al  ImageNet  architecture  learn  sig¬ 
nificantly  throughout  training.  This  variability  across 
data  sets  and  domains  emphasizes  the  need  for  a  tool 
like  ML-O-SCOPE  to  explore  and  diagnose  new  models 
trained  for  new  applications. 

6.2  User  Study 

ML-O-SCOPE  is  intended  as  a  tool  to  help  machine 
learning  practitioners  design  and  tune  optimal  pipelined 
models  in  less  time.  We  propose  to  directly  measure 
its  applicability  to  this  problem  domain  through  a  user 
study  comparing  the  system  against  the  current  alter¬ 
natives.  The  target  user  is  an  analyst  or  data  scien¬ 
tist  who  lacks  intimate  knowledge  of  model  design  and 
composition — including  of  particular  types  of  pipelines 
like  deep  convolutional  neural  networks — but  who  has 
practical  familiarity  with  machine  learning  models  and 
how  to  construct  them  from  data. 

Our  experimental  design  includes  a  small  group  (N  = 
20)  of  target  users  in  two  subgroups,  namely,  academic 
(students)  and  professional.  To  begin,  participants  are 
given  a  brief  introduction  to  the  basics  of  pipelined  mod¬ 
els,  and  of  convolutional  neural  networks  in  particular, 
including  a  short  guide  with  suggestions  and  best  prac¬ 
tices  for  tuning  them.  Participants  should  already  be 
familiar  with  standard  concepts  like  cross-validation, 
learning  rates,  regularization,  and  so  on. 

After  the  introduction,  participants  are  given  a  series  of 
model-tuning  tasks  to  complete.  For  each  task,  the  user 
is  provided  with  a  base  architecture  of  a  convolutional 
neural  network  pipeline  appropriate  for  the  CIFAR-10 
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data  set,  and  a  set  of  sub-optimal  base  parameters.  In 
addition,  users  are  given  an  interface  where  they  can 
modify  parameter  settings,  and  a  mechanism  to  train 
and  evaluate  a  model  given  their  current  settings.  The 
goal  of  the  task  is  to  minimize  model  error  against  the 
CIFAR-10  test  set  within  a  fixed  budget  of  model  itera¬ 
tions.  To  avoid  overwhelming  users,  the  task  is  limited 
to  tuning  a  selected  set  of  hyper-parameters  (e.g.,  fil¬ 
ter  sizes  and  filter  counts  per  layer),  and  the  pipeline 
architectures  remain  fixed. 

For  some  of  the  tasks,  participants  have  access  to  ML- 
O-SCOPE  to  review  the  models  trained  at  each  iteration. 
Half  of  the  participants  have  ML-O-SCOPE  for  the  first 
half  of  tasks,  and  the  other  half  for  the  second  half  only. 
In  addition  to  the  regular  introduction,  participants  re¬ 
ceive  a  brief  introduction  to  ML-O-SCOPE  before  the 
tuning  task  where  they  are  allowed  to  use  it. 

As  a  benchmark,  we  measure  the  performance  of  an 
expert  designer  of  convolutional  neural  networks  on  the 
same  tuning  task,  without  the  use  of  ML-O-SCOPE.  The 
two  primary  metrics  are:  first,  participants’  models  test 
accuracies  after  the  budget  of  tuning  iterations  is  ex¬ 
pended;  and  second,  the  number  of  tuning  iterations  it 
takes  participants’  models  to  achieve  near-expert  level 
test  accuracy,  as  determined  by  the  expert  designers  re¬ 
sults.  Participants’  performance  distributions,  accord¬ 
ing  to  both  metrics,  are  compared  for  tasks  completed 
with  and  without  the  aid  of  ML-O-SCOPE.  An  impor¬ 
tant  secondary  metric  is  the  time  taken  to  complete 
each  task.  We  expect  that  participants  will  achieve 
higher  accuracy  with  fewer  iterations  when  using  ML- 
O-SCOPE,  likely  at  the  cost  of  taking  more  time  per  it¬ 
eration. 

We  further  compare  against  automated  tuning  tech¬ 
niques’  performance  on  the  same  tuning  tasks.  Two  au¬ 
totuning  implementations — one  applying  the  Bayesian 
techniques  of  Yarnins,  Tax,  and  Bergstra  [18],  the  other 
using  random  search  of  the  parameter  space — are  run 
on  each  user  task,  with  the  algorithmic  search  space 
set  to  only  those  parameters  users  are  asked  to  opti¬ 
mize.  For  each  implementation,  we  measure  the  num¬ 
ber  of  iterations  to  reach  near-expert  performance,  as 
defined  above,  and  contrast  the  results  with  user  per¬ 
formance  with  and  without  ML-O-SCOPE.  We  argue 
that  the  number  of  iterations  is  of  greater  importance 
than  the  time  per  iteration  because  large  scale  pipeline 
tuning  time  is  typically  dominated  by  the  time  it  takes 
to  train  each  model  revision,  and  this  quantity  is  inde¬ 
pendent  of  the  tuning  method.  We  expect  that  human 
performance  dominates  algorithmic  performance  mea¬ 
sured  in  number  of  iterations. 

Beyond  measuring  the  overall  effectiveness  of  ML-O- 
SCOPE,  we  would  like  to  study  the  contributions  of  in¬ 


dividual  system  features  and  visualizations.  To  deter¬ 
mine  these  effects,  participants  are  asked  to  complete  a 
brief  survey  after  finishing  all  tuning  tasks.  The  survey 
asks  users  to  explain  why  they  made  specific  changes 
to  hyper-parameters  during  the  tuning  process,  what 
techniques  they  found  successful,  and  where  they  had 
difficulty.  We  expect  this  qualitative  feedback  to  give 
insight  into  the  uses  and  usefulness  of  specific  visualiza¬ 
tions. 


7  Future  Directions 

To  date,  ML-O-SCOPE  has  been  engineered  for  a  spe¬ 
cific  type  of  pipelined  model,  namely  convolutional  neu¬ 
ral  networks  for  visual  object  classification.  The  prin¬ 
ciples  behind  its  design,  however,  are  applicable  to  a 
wider  domain  of  both  models  and  applications.  We  see 
the  most  immediate  promise  from  supporting  more  gen¬ 
eral  vision  pipelines,  for  example,  visualization  support 
for  standard  features  like  HOG  and  SIFT.  These  exten¬ 
sions  would  enable  diagnostics  of  a  more  open  pipeline 
design  space  not  directly  tied  to  the  neural  network 
paradigm. 

In  addition,  the  system  provides  a  solid  platform  to 
explore  other  applications  of  visualization  to  pipelined 
models.  For  example,  implementing  new  pipeline  opera¬ 
tors  for  feature  extraction  from  image  data  is  a  difficult 
undertaking,  and  visualization  can  help  with  develop¬ 
ment  and  debugging  of  new  code.  Adapting  ML-O- 
SCOPE  for  code  diagnostics  could  be  a  powerful  exten¬ 
sion  to  the  system. 


8  Conclusion 

We  have  presented  ML-O-SCOPE,  a  visualization  tool 
aimed  at  helping  experts  understand  and  diagnose  issues 
with  convolutional  neural  networks.  The  tool  allows 
users  to  explore  various  aspects  of  structurally  complex 
pipelined  models — from  understanding  the  development 
of  convolutional  structure,  to  better  understanding  com¬ 
mon  types  of  misclassification — and  demonstrates  the 
applicability  of  visualization  to  the  challenges  of  opti¬ 
mizing  complex  object-recognition  pipelines. 
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